class: center, middle, inverse, title-slide # Extending Bayesian model averaging methodology for application across multiple unsupervised clustering methods
## Owen Forbes
July 7 2021
### ### Distinguished Professor Kerrie Mengersen
Dr Paul Wu
Queensland University of Technology --- class: list-space <style> .list-space li { padding: 0.25cm; } .list-nobullet li { list-style-type:none; } .center2 { margin: 0; position: absolute; top: 30%; left: 40%; -ms-transform: translate(-35%, -35%); transform: translate(-35%, -35%); } </style> --- background-image: url("data:image/png;base64,#imgs/LABS.jpg") background-position: right background-size: contain # Which clustering algorithm? --- .pull-left[<iframe src="imgs/plot_3d_km.html" width="620" height="320" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe>] .pull-right[<iframe src="imgs/plot_3d_hc_colourfix.html" width="620" height="320" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe>] .pull-left[<iframe src="imgs/plot_3d_gmm.html" width="620" height="320" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe>] --- ## Overlapping clusters <iframe src="imgs/plot_3d_core_yellow.html" width="1200" height="600" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe> --- ### Motivation - Bayesian Model Averaging for clustering - Different algorithms will emphasise different aspects of clustering structure -- - Choosing one 'best' model often arbitrary, unclear choice -- - **Inference not calibrated for model-based uncertainty** -- - Locking into one loses insights offered by other methods about plausible clustering structure -- - BMA offers a nice framework for combining clustering solutions - simple - flexible - intuitive --- .pull-left[ #### *k*-means - 'Hard' clustering - Minimises within-cluster sums of squares] .pull-right[**k-means objective function** <br> `$$J = \sum_{i = 1}^K (\sum_k{\lvert \lvert x_k - c_i \rvert \rvert ^2})$$`] -- .pull-left[ #### Hierarchical Clustering (Ward's Method) - 'Hard' clustering - Each observation starts out in its own cluster - Repeated pairwise fusion of clusters that minimises change in within-cluster sums of squares (Ward)] .pull-right[**Ward's objective function** <br> `$$D(c_1,c_2) = \delta^2(c_1,c_2) = \frac{\lvert c_1 \rvert \lvert c_2 \rvert }{\lvert c_1 \rvert + \lvert c_2 \rvert} \lvert \lvert c_1 - c_2 \rvert \rvert ^2$$`] -- .pull-left[ #### Gaussian Mixture Model - 'Soft' clustering - Models data as coming from a mixture of Gaussian distributions - Restrictive distribution ('thin tails') - Tends to prioritise cluster compactness, may tend to 'overcluster'] .pull-right[**Mixture of multivariate Gaussians** <br> `$$p(x_n|\mu, \Sigma, \pi,K) = \sum_{k=1}^K{\pi_k\mathcal{N}(x_n|\mu_k, \Sigma_k)}$$`] --- ### Combining Clustering Results with Bayesian Model Averaging -- - Weighted averaging of results by model quality - Quantification of model-based uncertainty -- `$$P(\Delta | Y) = \sum_{l=1}^L (\Delta | Y, M_l)P(M_l|Y)$$` -- - Limited development for Clustering - Finite mixture models (Russell et al., 2015) - Naive Bayes classifiers (Santafe & Lozano, 2006) -- - Lacks implementation across multiple clustering algorithms --- ### Extending Bayesian Model Averaging for Clustering Inference based on results from a single 'best' model ignores the uncertainty that arises from model selection -- - *M*-open scenario: the appropriate model class is not known in advance, but is determined iteratively as the analysis proceeds (Bernardo & Smith, 2009) -- `$$P(\Delta | Y) = \sum_{l=1}^L (\Delta | Y, M_l)P(M_l|Y)$$` -- - For valid inference in the clustering context, `\(\Delta\)` should have the same meaning for all models and all `\(k\)`, and be invariant to labelling of `\(k\)` (Hoeting 2002) Where `$$P(M_l|Y) = \frac{P(Y|M_l)P(M_l)}{\sum_{l=1}^L P(Y|M_l)P(M_l)}$$` -- - Enables more robust inference through probabilistic combination of outputs from multiple models --- ### Extending Bayesian Model Averaging for Clustering Previous work (Russell et al., 2015) has implemented BMA for model averaging across finite mixture models -- - Each clustering model `\(M_l\)` represented with pairwise similarity matrix `\(S^l\)` - each element `\(s_{ij}\)` represents the probability that data points `\(i\)` and `\(j\)` belong to the same cluster in that model - **invariant to labelling and number of clusters** -- <img src="data:image/png;base64,#imgs/similarity_matrices.jpg" width="2981" /> --- ### BMA for Mixture Models - BIC weighting - `\(P(Y|M_l)\)` typically involves a difficult/intractable integral and is often approximated for many applications (Fragoso et al., 2018) -- - **Russell et al. (2015)** weight results from multiple GMMs according to BIC `$$P(M_l \lvert Y) \approx \frac{exp(\frac{1}{2}{BIC}_l )}{\sum_{l=1}^L { exp(\frac{1}{2} BIC_{l}) }}$$` -- - BIC definition for GMM `$${BIC}_l = 2\log(\mathcal{L}) - \kappa_m \log(N)$$` -- - GMM likelihood `$$\mathcal{L}(\Theta) = \sum_{n=1}^N \sum_{k=1}^K{\pi_k\mathcal{N}(x_n|\mu_k, \Sigma_k)}$$` -- - Multivariate Gaussian density `$$\mathcal{N}(x|\mu, \Sigma) = \frac{\exp\left\{ -\frac{1}{2} (y-\mu)^T \Sigma^{-1}(y-\mu)\right\}}{|\Sigma|^{\frac{1}{2}} (2 \pi)^{\frac{D}{2}}}$$` --- background-image: url("data:image/png;base64,#imgs/separation_compactness.png") background-position: right background-size: 600px ### Aside: Cluster internal validation indices .pull-left[.content-box-blue[ - Often used as a proxy for model quality in clustering {{content}} ] ] -- - Typically measure compactness and/or separation of clusters {{content}} -- - Choose between candidate models with different numbers of clusters `\(k\)` {{content}} -- - Interpreted similarly to model evidence - Agnostic to clustering algorithm - Most do not require likelihood term --- background-image: url("data:image/png;base64,#imgs/separation_compactness.png") background-position: top right background-size: 300px ### New proposed weighting / approximation for posterior model probability BIC for GMM driven by **Multivariate Gaussian density:** `$$\mathcal{N}(x|\mu, \Sigma) = \frac{\exp\left\{ -\frac{1}{2} (y-\mu)^T \Sigma^{-1}(y-\mu)\right\}}{|\Sigma|^{\frac{1}{2}} (2 \pi)^{\frac{D}{2}}}$$` -- **Xie-Beni index** - Ratio of separation to compactness (maximise) `$$XB = \frac{\sum_i{\sum_{x \in C_i}d^2(x,c_i)}}{n(\min_{i, j \neq i} d^2(c_i,c_j))}$$` -- **Calinski-Harabasz Index** - Ratio of separation to compactness (minimise) `$$CH = \frac{\sum_i{n_i d^2(c_i,c)/(NC-1)}}{\sum_i{\sum_{x \in C_i}d^2(x,c_i)/(n-NC)}}$$` -- - XB and CH have complementary strengths (Aggarwal & Reddy, 2012) --- ### New proposed weighting / approximation for posterior model probability <br> New proposed weight: `$$\mathcal{W}_m = \frac{\frac{1}{CH_{m}}}{\sum_{m=1}^M {\frac{1}{CH_{m}}}} + \frac{XB_m}{\sum_{m=1}^M {XB_{m}}}$$` Approximate posterior model probability for weighted averaging: `$$P(M_m \lvert Data) \approx \frac{\mathcal{W}_m}{\sum_{m=1}^M \mathcal{W}_m}$$` --- class: center, middle ## Case study: Clustering adolescents based on resting state EEG recordings --- ## Individual algorithms .pull-left[ k-means <img src="data:image/png;base64,#imgs/km_similarity.png" width="300" /> `$$\mathcal{W}_m = 0.36$$`] .pull-right[ Hierarchical Clustering <img src="data:image/png;base64,#imgs/hc_similarity.png" width="300" /> `$$\mathcal{W}_m = 0.27$$`] .pull-left[ Gaussian Mixture Model <img src="data:image/png;base64,#imgs/gmm_similarity.png" width="300" /> `$$\mathcal{W}_m = 0.37$$`] --- ## Consensus matrix <img src="data:image/png;base64,#imgs/eeg_consensus.jpg" width="800" /> --- ## Consensus dendrogram <img src="data:image/png;base64,#imgs/eeg_consensus_dendro.jpg" width="800" /> --- ## BMA Clusters: weighted mean of cluster membership probability (pairwise) <iframe src="imgs/eeg_3d_BMA_size_mean.html" width="1200" height="600" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe> --- ## BMA Clusters: weighted standard deviation of cluster probability (pairwise) across models <iframe src="imgs/eeg_3d_BMA_size_sd.html" width="1200" height="600" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe> - Uncertainty can be propagated forward for further analysis in a Bayesian framework --- class: center, middle ## Small simulation study --- ## Simulated datasets - far - mid - close --- ## BMA solutions (left:size = probability; right: size = model-based uncertainty) --- ## Strengths 💪 -- - Don't need to wed ourselves arbitrarily to one 'best' method -- - Keep the insights offered by multiple solutions -- - Accepts 'hard' or 'soft' clustering solutions, from any algorithm, with any number of clusters -- - Intuitive probabilistic framework for combining and interpreting results -- - Approachable and easy to implement --- ### Next steps -- Extend simulation Study - Investigate performance with more simulated clustering datasets in higher dimensions - Look at ability to down-weight poor solutions -- Case Study - Propagate model-based uncertainty from BMA for cluster-based inferences -- Preprint coming to Arxiv shortly -- Publish R package for BMA across multiple clustering algorithms --- ### Key Reference Russell, N., Murphy, T. B., & Raftery, A. E. (2015). Bayesian model averaging in model-based clustering and density estimation. arXiv preprint arXiv:1506.09035.